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There have been several researches done in the field of image saliency but 
not as much as in video saliency. In order to increase precision and accuracy 
during compression, reduce coding complexity and time consumption along 
with memory allocation problems with our proposed solution. It is a 
modified high-definition video compression (HEVC) pixel based consistent 


spatiotemporal diffusion with temporal uniformity. It involves taking apart 

the video into groups of frames, computing colour saliency, integrate 
Keywords: temporal fusion, pixel saliency fusion is conducted and then colour 
information guides the diffusion process for the spatiotemporal mapping 
with the help of permutation matrix. The proposed solution is tested on a 
: : publicly available extensive dataset with five global saliency valuation 
compression pixel metrics and is compared with several other state-of-the-art saliency detection 
Image saliency methods. The results display and overall best performance amongst all other 
Spatiotemporal diffusion candidates. 
Video saliency 


Computing colour saliency 
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1. INTRODUCTION 

The world has tried to imitate the functioning of the human eye and the brain. The marvel of the 
brain to distinguish among the important and non-important features of the view the eyes are seeing and take 
in only whatever is necessary. Various researchers have imitated this process and in today’s word, we have 
this in the form of conference videos, broadcasting and streaming. There have been several researches in the 
field of image saliency but not in video saliency. Few researches that have made a significant impact in this 
field. Itti’s model is one of the most [1] researched and most prominent models for image saliency. Fourier 
transformation is used with the help of phase spectrum and [2], [3] helps image saliency using frequency 
tuning. They have used the principles of inhibition of return and winner take all that is inspired from the 
visual nervous system [4], [5]. 

It is difficult for video saliency detection, as images are not still, making memory allocation and 
computational complexity increased. It has a video saliency detection methodology [6] that involves 
determining the position of an object with reference to another. They use computation of space-time-saliency 
map as well as computation of motion saliency map [7]-[10]. Fused static and dynamic saliency mapping 
[11] to obtain a space- time saliency detection model. Here dynamic texture model is employed [12] to obtain 
motion patterns for both stationary and dynamic scenes. 

They have used fusion model but it results in low-level saliency [13]-[15]. They have used global 
temporal clues to forge a robust low-level saliency map [16], [17]. The disadvantage of these methodologies 
is that the accumulation of error is quite high and this has led to several wrong detections. 
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The proposed solution is a modified spatiotemporal fusion saliency detection method. It involves a 
spatiotemporal background to obtain high saliency values around the foreground objects. Then after ignoring 
the hollow effects, a series of adjustments are made to the general saliency strategies to increase efficiency of 
both motion and colour saliencies. The usage of cross frame super pixels and one to one spatial temporal 
fusion helps in overall increase in accuracy and precision during compression. 


2. RELATED WORK 

In this section, the works of some of the research papers that have helped in the completion of the 
proposed algorithm have been mentioned. This survey talks about the various video saliency methodologies 
along with their advantages and disadvantages [18]. Borji [19], it has also the same outline of the paper but it 
also includes the various aspect, which make it difficult for the algorithms to imitate the human eye-brain 
coordination and how to overcome them. 

This paper has a notable contribution to this field of research [20]. It has a database named dynamic 
human fixation 1K (DHF1K) that helps in pointing out fixations that are needed during dynamic scene free 
viewing, then there is the attentive convolutional neural network-long short-term memory network (ACLNet) 
which has augmentations to the original convolutional neural network and long short-term memory (CNN- 
LSTM) model to enable fast end-to-end saliency learning. In this paper [21], [22] they have made some 
corrections in the smooth pursuits (SP) logic. It involves manual annotations of the SPs with fixation along 
the arithmetic points and SP salient locations by training slicing convolutional neural networks. 

High-definition video compression (HEVC) system has become the new standard video 
compression algorithms used today. With making changes to the HEVC algorithms with the help of a spatial 
saliency algorithm that uses the concept of a motion vector [23], It has led to better compression and 
efficiency. They haves introduced a salient object segmentation that uses the combination of conditional 
random field (CRF) and saliency measure. It has used statistical framework and local colour contrasting, 
motion and illumination features [24]. Fang et al. [25] is also using spatiotemporal fusion with uncertainty in 
statistics to measure visual saliency. They have used geodesic robustness methodology to get the saliency 
map [26], [27]. Has been a great help to our solution formation with its super-pixel usage and adaptive colour 
quantization [28]-[30]. Its measurement of difference between spatial distance and histograms has helped to 
obtain the super-pixel saliency map. They gave us an overall idea of the various evaluation metrics to be used 
in this paper [31], [32]. The first section has the introduction and section 2 succeeds it with the related work 
[33]. Section 3 and 4 displays the proposed algorithm, its methodologies and modifications along with its 
final experimentation and comparison. Section 5 concludes the paper. 


3. PROPOSED SYSTEM 
3.1. Modeling based saliency adjustment 

The robustness is obtained by combining long-term inter batch information with colour contrast 
computation. Background and foreground appearance models are represented by By € R?*’" and Fy € 
R°*/" with bn and fn being their sizes respectively. The i — th super pixel’s RGB history in all regions is 
taken care of with the following equations intrac, = exp(A — |p(MC;) — y(CM;)|); A = 0.5 and inter, = 


a 1 
min||(Rj,G;,B;),Bal|,°5— D||(RiGiBi).Bmll| : : : : 
———_—_—_ +42). Here, Ais the upper bound discrepancy degree and helps inversing 
min||(RpG,B).F ml|,7q || RvGiBd.F ll, 


the penalty between the motion and color saliencies. 


3.2. Contrast-based saliency mapping 

The video sequence is now divided into several short groups of frames G; = {F,, Fy, Fs, ...., Fy}. 
Each frame F;,, where (kdenotes the frame number) undergoes modification using simple linear iterative 
clustering with boundary-aware smoothing method which removes the unnecessary details. The colour and 
motion gradient mapping to help form the spatiotemporal gradient map with help of pixel-based computation 
is given by SM; = ||wx, uy| |,O| |\VF II,- That is, horizontal and vertical gradient of optical flow and V(F) 


colour gradient map. We then calculate the i — th super pixel’s motion contrast using (1). 


lluiw,ll, 


MC: = Sapewqagaqe Bi = +12 llaieaj||, = 23 (1) 
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Where I, norm has been used and U and a; denote the optical flow gradient in two directions and i — 
th super-pixel position centre respectively. w,is used to denote computational contrast range and is calculated 
using shortest Euclidean distance between spatiotemporal map and i — th superpixel. 


Tr . . . . 
= ami, =Tellnallsr |Asmr,)||, ; l= 0.5 min{width, height}, A ~ down sampling (2) 


Colour saliency is also computed the same way as optical flow gradient, except we use the red, blue 

|(RiGiBi).(R7,GjB;)| 

Neue ONE ree 
llaiajl], 

following equation smoothens both MC and CM as temporal and saliency value refining is done by spatial 

information integration. 


and green notations for the i — th super pixel. So, the equation is CM = Lajev; 


Pigetéss Beth Day fey EXP Meee ojll2/H)-CMz,j 3) 
ie Was exp (~II¢x,i¢ 7 jll1/H) 
T=k—-1 a7 jEud k,vo tj 
Here, cx; is the average of the i — th super-pixel RGB colour value ink — th frame while o controls 


smoothing strength. The lax. ii a,,jl| < 6 needs to be satisfied and this is done using p. 
a mle 


1 


mxn 


@= 


Wher DP MADE F (SM, ), F(SMr,, las mn = frame numbers (4) 


1 
: < ym ; 
F(SM;,) = eau <€xX TYE SMr,, . 


filter strenght control (5) 
0, otherwise 


At each batch frame level, the q — th frame’s smoothing rate is dynamically updated with (1 — 
y)Os-1 + V9; > 9; y = (learning weight ,0.2). Now the colour and motion saliency is integrated to get 
the pixel-based saliency mapLL; =CM © MC. Since this fused saliency maps increases accuracy 
considerably but the rate decreases, so this will be dealt with in the next section. 


3.3. Accuracy boosting 
Matrix M is to be considered as the input. It will be decomposed using sparse S and low level D with 
min alls\|, + ID II, subj = M =S5+D where the nuclear form of D is used. With the help of robust 
principal component analysis (RPCA) [30] and is showcased using S < sign(M — D — S)[|[M-—D-—S|- 
ap|, and D<«V[zX— BI],U, (V,2,U) <— svd(Z). Where svd(Z) denotes singular value decomposition of 
Lagrange multiplier and a@ and f represent lesser-rank and sparse threshold parameters respectively. For 
reduction of incorrect detections caused by the misplacement of optical flow of super pixels in the 
foreground’s region, the given region’s rough foreground is located and feature subspace of a frame k is 
spanned as gl, = {LLs,,LLs,5) deci LL sy m3 and thus for the entire frame group we get gB,= 
(911, Glo,-+,gIn}. This way the rough foreground is calculated as Rp, = [kai LLs,,— 

a esi Lier bls las 
Here w is reliability cotrol factor and we also get two subspaces by LL; and RGB colour and it is 
given by SB = {cvy,cv2,....,CY} € R°’*" where cv; = {vec(Ri1, Gia, Biz, +) Rim Gim Bim)}* and 
Sp = vec(LLs,), ...vec(LLs,,) € R°*”. This helps in making a one-to-one correspondence and then pixel- 
based saliency mapping infusion that is dissipated on the entire group of frames. SBoverS; causes disruptive 


foreground salient movements and hence with the help from [31]-[33] this issue was resolved with an 
alternate solution. 


nxm 


Mec Bagol!Mell. + [Pell + [14 + 91], + cal lSell, + a2[ Self III. 


nuclear norm, A is position matrixs.t M, = De + S,, Ms = D; + Sy, Me = SB OY, 
M, = SF O0,0 = {Fy Boon En} BE; €(0,1P"", B® = 1, (6) 


D,,D,,variables represent colour and saliency mapping, ¥ is the permutation matrix while S,,S, 
represents colour feature sparse component space and saliency feature space. This entire equation set helps in 
correcting super-pixel correspondences. 
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3.4. Mathematical model 
As shown in (6) generates a distributed version of convex problems D(M,,,S,,,0,A © V9) = 
a |ISell, + a|lE xl, + Bi |IMcI 5 ote Bo||Mxe| a + \|A ) ||, + trace(Zk (M, = Dd, I S.)) + 


trace(Z (My —D,- Sx) + “( |M, — D. - Sell, + \|(My —D,- S.)I|,). Where Z; represents Lagrangian 


multiplier. zdenotes steps of iterations and the optimized solution using partial derivative SE — 5 lSéx - 


. 1 
(Mi — Séx + Ziy/nk | [3 + min a2 ||Séx _/mk and Dex” = 5 |IDEx — (Mex — Dex + Zi2/mk|I3 + 
CX 


min Biz ||DE.|| /rk. 
Dex i 


k 
D; is updated to become Dé{* — UX +V [2 - ae , where (V,2,U) < svd (me. — Si. + 42), 


Similarly, for S,,Si¢1 < sign (2) [7 - an as] = ME, — DK, + ZE,./mk. 

Value of E is determined are used to compute the norm cost L € R™*™ is calculated as Lk; = 
[Os Hn DI], Vs = H(SB, WE) © By and Uf = ||O.; — H(V2./)|], Ve = HSB, k) © Ey. Then we use 
and objective matrix O to calculate the k —th of R,; and the equation is O,; = S,,,(k,i) + Dox(k,i) — 
Z12(Kk,i)/mk . There is a need to change L, as it is hard to approximate the value of min||A + o||,. L, = 


Cee + diy. 12 + diz, %nm + a € R™*™ for k = [k —1,k +1] is hanged to L, as shown in (7). 
A(Ly, j) & are, Lp ver ACL, v). exp (-| lee; Ck, j || 1/p) (7) 


The global optimization is solved using the equations SF**1 — SFX © 9,SB**1SBk © 9 and 
AS - wk(ME, - De - SE) + Zio where 7,41, <— 1, X 1.05. The alignment of the super pixels is now 
given by gS; = —— t=1,i+7r (SF © 8,7). To reduce the incorrect detections and alignments we introduce 
SF and use (8)-(10). 


SF —SFQV (8) 


SF — SF -(1™" — X(S,)) + p- SF + X(S;) (9) 


lun coo @& 
tiie pelea <SE, (10) 


2, otherwise 


The equation for mapping for the i-th video frame is’ given bygS; = 
H(p,i)—(H(p,i).X (Sc) yn ; 
H(p,i)(n-1) T=1,14T 
group’s frames based of degree of colour similarity. The final output is given by gS,j; = 
Xr VrtLier Vi ISi,j 
Yrt Lier Yi 


H(SF © 0,tT). There is a need to diffuse inner temporal batch x, of the current 


;Vy = exp (— ley. jr Gi, ill /). Where x,;showcases the colour distance-based weights. 
2 


4. RESULTS, EXPERIMENTS AND DATABASE 

The proposed solution has been compared with [34] as a base reference as well as by [35]’s 
operational block description length (OBDL) algorithm, [36]’s dynamic adaptive whitening saliency (AWS- 
D) algorithm, the object-to-motion convolutional neural network two layer long short-term memory 
(OMCNN-2CLSTM) algorithm in [36], attentive convolutional (ACL) algorithm [37], saliency-aware video 
compression (SAVC) algorithm from [38] and [39]. The database used is the same as the one in the base 
paper. It is a high-definition eye-tracking database with its open source available at GitHub 
https://github.com/spzhubuaa/Video-based-Eye-Tracking-Dataset [40]. 10 video sequences with 3 different 
resolutions, 1920 x 1080, 1280 x 720, and 832 x 480, were taken for experimentation. For evaluating the 
performance of all the saliency methods, we employed five global evaluation metrics, namely area under the 
ROC curve (AUC), Similarity (SIM), correlation coefficient (CC), normalized scanpath saliency (NSS) and 
Kullback-Leibler (KL). 
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The XU algorithm is quite similar to HEVC; hence its saliency detection is better than most 
algorithms but is faces problems when there are complex images as input. Other than that, our proposed 
solution has performed remarkably well and has the best compression efficiency and precision among all the 
algorithms in comparison. Table 1 shows results for saliency algorithms that are used. Figure 1 shows the 
saliency evaluation and comparison graph. 


Table 1. The following results for saliency algorithms used: fixation maps, XU [40], base paper [34] and 


proposed algorithm 
Parameter BasketBall FourPeople RaceHorses 
Fixation Maps 
XU [40] 


Base Paper [34] 


Proposed 
algorithm 


Figure 1. Saliency evaluation and comparison graph 


5. CONCLUSION 

This paper has proposed a solution called modified spatiotemporal fusion video saliency detection 
method. It involves a modified fusion calculation along with several changes to the basic HEVC code to 
include colour contrast computations, boost both motions, and colour values. There is also spatiotemporal of 
pixel-based coherency boost to increase temporal scope saliency. The proposed work is tested on the 
database as same as that of the base paper and is compared with other state-of-the-art methods with the help 
of five global evaluation metrics AUC, SIM, CC, NSS and KL. It has been concluded that the proposed 
algorithm of this paper has the best performance out of all the mentioned methods with better compression 
efficiency and precision. 
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